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AUTHOR'S NOTE 


The present work is a revised version of my previous 
Technical Memorandum (TM), "Information Theory and the 
Earth's Density Distribution," NASA TM-78088, dated 
February 1978. There are three reasons for publishing 
this revision. 

First, I included a discussion of Rietsch (1977). 

I was unaware of his important pioneering paper until I 
was kindly advised of it by the editor and referees of 
the Geophysical Journal . 

Second, I expanded the discussion of certain points 
(such as the nature of probability) which were only briefly 
mentioned in the original TM. 

Third and last, I presented the new material on 
Shannon's information measure for continuous probability 
distributions . 

These reasons, I feel, are more than sufficient for 
producing a revision of the earlier work. 


INFORMATION THEORY AND THE EARTH'S 
DENSITY DISTRIBUTION 

David Parry Rubincam 
NAS-NRC Resident Research Associate 

ABSTRACT 

The "most likely" density distribution inside the 
earth is derived from Jaynes's (1957) information theory 
approach. The earth is assumed to be spherical and the 
density distribution spherically symmetric. The known 
mass and moment of inertia are used as constraints on the 
density distribution. The partitioning of particles among 
cubical boxes and use of the grand canonical ensemble from 
statistical mechanics result in a density distribution of 
the form p(r) = 12.30 exp(-1.46r /ag)g/cm where a^ is 
the radius of the earth. This differs from the density 
distribution derived by Rietsch (1977), who also used 
the information theory approach. The difference results 
from Rietsch allowing the density to vary continuously 
inside the volume elements rather than in discrete steps 
as done here. Some criticisms of information theory 
inference are discussed. In particular, Shannon's (1948) 
generalization of the information measure to continuous 
probability distributions is defended as the more useful 
measure in the continuous case over the Kullback measure. 
Future directions for information theory inference in solid 
earth geophysics are indicated. 
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INFORMATION THEORY AND THE EARTH'S 
DENSITY DISTRIBUTION 


1. INTRODUCTION 

In a recent paper, Rietsch (1977) introduced Jaynes's (1957) 
information theory approach to inverse problems in solid earth 
geophysics. The information theory approach is a method of scien- 
tific inference which has had great success in statistical me- 
chanics (see e.g., Jaynes 1957, 1963; Tribus 1961; Katz 1067; and 
Baierlein 1971) and in spectral analysis (e.g.. Burg 1967, 1968, 
1972; Smylie et a^. 1973; and Graber 1976). Rietsch (1977) 
applied the approach to two problems. The first problem dealt with 
spectral analysis; I will not discuss it at all here. The second 
dealt with inferring the density distribution for the earth from 
knowledge of its mass and moment of inertia; the earth is assumed 
to be spherical and the density distribution spherically symmetric. 

I have also applied information theory inference to the very 
same problem of the earth's density distribution. My approach, 
however, is somewhat different from Rietsch 's (1977), and conse- 
quently so is the density distribution. In this paper I present 
these results together with a discussion of the differences be- 
tween the two approaches, and some general comments on information 
theory not discussed by Rietsch (1977). 

2. THE INFERENCE PROBLEM 

This is the nature of the problem: we desire to know what 

the density distribution p(r) is as a function of radial distance 
r from the center of the earth, but the only information we have 
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about the earth Is its mass and moment of inertia Cj;, both of 
which depend upon p(r). Clearly we do not have enough informa- 
tion to say what p(r) actually is. Any proposed distribution 
which satisfies the mass and moinent of inertia is nonunique; 
there are infinitely many other distributions which also satisfy 
the given data. Hence we cannc.c 2 nyer_t the data; we must infer 
an answer from incomplete data. 

There are several methods for dealing with this problem. 

(For a general discussion see Bullen 1975, pp. 60-64.) The ap- 
proach of Backus and Gilbert (1967, 1968) is to study all solu- 
tions consistent with the given data; this is called the geophys- 
ical inverse problem. The Backus-Gilbert approach has been used 
extensively. See, for example, Gilbert et aJ. (1973); Parker 
(1977a, 1977b); Jordan and Franklin (1971); and references cited 
by Parker (1977a, 1977b), Richards (1975), Anderson (1975), and 
Engdahl ^ (1975). A quite different method is that of Press 

(1968a, 1968b), who adopted a Monte Carlo technique of testing a 
wide range of models against the data and retaining only those 
which agreed with it. However, the commonest method by far is 
that of modeling; By introducing certain assumptions in addition 
to the data, the answer becomes unique. The assumptions of the 
Adams-Willlarason equation and uniform chemical composition, for 
Instance, plus the known mass, seismic velocities, and surface 
density determine a unique density distribution (Alterman £t al . 
1959, pp . 80-81), Of course a difficulty with modeling is that 
the assumed conditions may not hold. 

Information theory inference approaches the problem from the 
followir ■■ viewpoint: We cannot reject any possible answer (in 


2 


our case, density distribution) which agrees with the known data. 

We do feel, however, that some answers are more likely than others. 
So we do the following: Assign each possible answer a probability 

that it is the correct answer, then apportion the magnitudes of the 
probabilities in accordance with the data we have on hand. For 
this purpose we need to define the word "probability". 

3. PROBABILITY 

There are two major schools of thought on the nature of 
probability (Howard 1968, pp. 211-212). Presently, the majority 
of probability theory users hold that a probability is an objec- 
tive quantity. A coin, for example, has a certain probability 
of falling hea^' just as it has mass and angular velocity. The 
way to measure be probability is to flip the coin a large number 
of times and note the frequency of occurrence of heads. 

While this is the traditional view of probability, it has 
the requirement of repeatability. If, for example, we discuss the 
probability of a successful launching of the next space station, 
then the objective view is of no use. The next launching is a 
one-of-a-kind affair, unlike the flip of a coin. The same is true 
of our topic: There is only one earth with one density distribu- 

tion. There is no "ensemble of earths"! 

This difficulty leads to the second, more powerful view of 
probability: Subjective probability, used in information theory 

inference and decision theory. The subjective view holds that a 
probability reflects our state of knowledge about phenomena, 
rather than about the phenomena themselves (Howard 1968, p. 211). 


3 


V.e would assign equal probabilities for a coin falling heads or 
tails, for instance, if we have no information which would cause 
us to prefer one outcome over the other. Hence a probability 
represents our "degree of rational belief" (Baierlein 1971, p. 13) 
that a particular outcome will occur. It need not be repeatable. 
The probabilities are subjective in the sense that they depend 
on a state of knowledge, and one person's data may differ from 
another ' s . 

The subjective view of probability has been around for quite 
some time. It was held by Bayes and Laplace, and quantitative 
treatments have been given by John Maynard Keynes, Harold Jeffreys, 
John G. Kemeny , and Rudolf Carnap (Cox 1961), It was not until 
recently, however, that this view has gained many adherents. One 
reason for this situation is that it ran counter to the prevailing 
dogma of the objective view, as discussed by Jaynes (1967). 

Another reason was the lack of a cogent set of axioms from the 
subjective probability theorists from which to derive the probabil- 
ity calculus. This defect was remedied by Cox (1946, 1961), whose 
axioms are so simple yet compelling that they lead to the usual 
laws of probability without the introduction of ensembles or fre- 
quencies. This, plus the Introduction of information theory by 
Shannon (1948), has caused the numbers of subjectivists to wax 
and objectivists to wane (Howard 1968, p. 212). 

The probability calculus alone does not tell us how’ to as- 
sign probabilities; it only gives us rules for operating with 
them. What we need is a way of c.omputing magnitudes of probabil- 
ities consistent with given data. This is where Jaynes's (1957) 
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information theory approach comes in. (Baierlein 1971 has an 
excellent general discussion of the information theory approach. ) 

4. JAYNES'S PRINCIPLE OF MINIMUM PREJUDICE 

4.1 SHANNON'S INFORMATION MEASURE 

At the heart of the approach is Shannon's (1948) information 
measure 

MlCPj^.Pg, . . .P^) = -KE^P^lnP^ (4.1) 

Here Pj^ is the probability that the ith of N possible answers is 
the correct answer and K is a positive constant. This function 
was originally termed the entropy function (Tribus and Mclrvine 
1970, p. 180), due to its similarity to thermodynamic entropy. 

For this reason the information theory approach is often called 
the Maximum Entropy Method, or MEM for short. The relationship 
between the information measure and thermodynamic entropy is deep, 
but the two are not identical (Baierlein 1971, pp. 473-478). To 
avoid confusion I will follow Baierlein (1971, p. 64) and call 
Shannon's information measure MI (Pi, P 2 ,...,P|^), where MI stands 
for Missing Information, or the amount of information needed to 
determine which answer is correct. A better term for the approach 
would be ITI, or Information Theory Inference, rather than MEM. 

MI is not dimensionless (Edmundson, private communication, 
1976), a fact that does not appear to be explicitly noted by Katz 
(1967) or Baierlein (1971). It carries units of information. For 
example, if we change the base of the logarithm in (4.1) from e 
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1, then 


to 2, which changes K to a new constant X'. and set K' = 

MI = -SP^log 2 P;j^ and MI is measured in bits. In the following 
development I will retain the nature! logarithm base and set K = 1, 
so that MI is measured in nats (from natural digits ; McEliece 
1977, p. 15). I v/ill also suppress the units but it should be 
remembered that MI is not a dimensionless quantity. 

The importance of the MI function is its uniqueness; that is, 
given certain very reasonable assumptions of how the MI function 
should behave, one is inevitably led to (4.1). In this respect it 
is like Cox's (1946, 1961) derivation of the probability calculus. 

I will not state the assumptions or prove that they lead to 
MI. The assumptions are given by RieLsch (1977, p. 491), and 
proofs are supplied by Shannon (1948, pp. 419-420) and Baierlein 
(1971, pp. 64-74). Rather, I will merely indicate its plausibility 
with an example. But first we note from (4.1) that MI>0; the 
amount of information needed to single out the correct answer is 
never negative. This is certainly an intuitively desirable 
property. Now let us suppose that all of the probabilities are 
equal. In this case it can be shown that MI attains its maximum 
value. This accords with intuition: we are surely in a state of 
maximum ignorance (i.e., need the most information) if we can 
favor no answer above another in terms of probability. Suppose 
now we have discovered that the j th possibility is the cc/rrect 
answer. Then P. = 1 and = 0 for i j . How much information is 
missing now? In this case Pjln Pj = 1 In 1 = 0, and P^ln Pj^ = 0 
for i (by virtue of lirn xln x = 0). Thus MI = 0; no inform.a- 

X -*-0 

tion is missing; we have the answer. This also accords with 


intuition. Normally our ignorance lies between these two extremes, 
and MI takes on .alues accordingly between its maximum and 0, 

Hence MI is a plausible measure of missing information. 


4.2 JAYNES'S PRINCIPLE OF MINIMUM PREJUDICE 


The essence of the information theory approach is this: 
choose the probabilities P^^ jPg , . . . ,Pj^ of the possible outcomes 
to make MI as large as possible, subject to the constraints of 
the known data. This is Jaynes's principle of minimum prejudice 
(Tribus and Rossi 1973). The information theory approach is there- 
fore rational method for assigning probabilities. In statistical 
mechanics, this procedure is equivalent to maximizing the entropy 
(Morse 1969). 

To illustrate the technique with an example, suppose that 
we do not know the mass of the earth exactly, but do know that it 
must be chosen from the values M^^, M,,,...,Mj^. Aside from i:Pj^ = 1, 
this is all v/e know. We must find P j , the probability that Mj is 
the correct mass, by maximizing MI. This is done by taking the 
partial derivative of 


N 

-Z P.lnP. 
i=l ^ ^ 


N 


with respect to each Pj^ and setting it equal to zero. The q!q is 
a Lagrange multiplier which ensures that all of the probabilities 
add up to 1. Carrying out the process yields 

-InP^ - 1 + Cq = 0 


or 


P. = ^ = constant 

1 
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The unknown cvq may be found from the constraint 

N 


giviuR 


P, == 1/N 

All of the probabilities are equal. We know nothing about the 
various and therefore cannot favor one particular value over 
another. 

Now suppose wo obtain further information; e.g., we learn 
that the expectation value of the mass is 

N 

E P.M. = M„ 

We then rea.sslgn probabilities in accordance with Jaynes's 
principle : 

_^-ZP.lnP^ + ttQEP. + a^EP.M^] =0 i = 1,2 ,.. ,N 

OPi 

giving an exponential function in : 


where cvq and 
straints 


P. 


are Lagrange multipliers to be found from the con- 


EP. 

1 


= 1 , 


EPiMi 



Note that our method is completely analogous to that of the can- 
onical ensemble in statistical mechanics (Morse 1969, pp. 268-269). 


Indeed, the mathematics is identical. The only difference is in 
the philosophical basis, which indicates that the method has broad 
applicability and is not confined to statistical mechanics 
(Jaynes 1963, p. 192). 

Maximizing the MI function is obviously the key point in the 
information theory approach; it provides us with the magnitudes 
of the probabilities. Hence justification for this approach is 
necessary. 

The justification goes like this: MI is the unique measure 

for determining the amount of information needed to single out 
the correct answer. Any method for assigning probabilities which 
does not maximize MI under known constraints (knowledge) tacitly 
assumes information it hasn't got! In other words, if someone 
assigns probabilities not in accordance with Jaynes's principle, 
that person is prejudicing the probabilities without foundation 
in the known data. Thus is derived the name, "principle of 
mini'^um prejudice" . 

This point is particularly clear in our last example, where 
we knew one of the Mj^ was the correct answer for the mass of the 
earth, but had no other information (other than = l)^ In this 
case, Jaynes's principle assigned equal probabilities to all out- 
comes. We were completely ignorant as to which answer was cor- 
rect. If someone used some other principle, and assigned, for 
example, a larger probability to M]^ than to the other M^, we can 
legitimately ask, "You favored as being the most likely mass 

over all of the others. What basis (i.e., information) do you 
have for doing that?" 
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The user of some other principle or function also runs the 
risk of being inconsistent (Jaynes 1957, p. 623; Rietsch 1977, 
p. 493). Hence minimum prejudice, plus consistency, give the 
function MI a powerful claim to being the proper choice. 

INFORMATION THEORY DENSITY DISTRIBUTION 
I now present my own development of the information theory 
density distribution, and afterwards compare it to Rietsch's 
(1977). I will make heavy use of the methods of statistical 
mechanics; particularly that of the grand canonical ensemble 
(Morse 1969, pp. 316-327). 

Imagine a three-dimensional Cartesian coordinate system with 
its origin at the center of the earth. The grid system will 
divide up the earth into many cubes of identical volumes 
V = dx*dy*dz, just as ordinary graph paper divides up a plane into 
squares of equal area. We can approximate the spherical surface 
of the earth as closely as we like by making the cubes as small 
as we like. Let r^ be the vector from the center of the earth to 
the jth cube and set I I = . Let the mass of the earth be the 

sum of the masses of a large number of indistinguishable particles, 
each with mass m. The particles are distributed amongst the cubes, 
with Uj particles occupying the jth cube. The mass Mg and moment 
of inertia Cg of the earth are then 

= En.m (5.1) 

^ j ^ 

Cg = (2/3)En^mrj^ 
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where the subscript j runs over all of the cubes comprising the 
earth. The factor (2/3) appearing in the second equation makes 
use of the assumption that the density distribution is spherically 
symmetric, and takes care of rj being the distance from the center 
of the earth to a cube and not the distance to some axis of ro- 
tation. This factor may be verified by taking the cubes to be so 
small that we can switch from summations to integrals without 
serious error, and then integrate over latitude and longitude. 

Let me remark here that we have chosen cubes of equal volume 
so as to treat all regions of the earth identically. We have also 
chosen indistinguishable particles because the interchanging of 
particles leaves the density distribution unaffected. In other 
words, the only information needed to characterize the density 
distribution is to know the number of particles in each cube, and 
not to know which particular particle is in which cube. We make 
no commitment as to the values of m and V. As we shall see, they 
drop out of the final equation for the density distribution. These 
assumptions will be further discussed later on. 

Our problem is the following. A possible model for the earth 
is one which has n^^ particles in cube 1, U 2 particles in cube 2, 
and so on. Each possible model will be given the subscript i, so 
that and are the mass and moment of inertia, respectively, 
for the ith model. We place no restrictions on the number of 
particles allowed to occupy each cube, so there are an infinite 
number of models. Our task is to assign each possible model a 
probability that it is the correct model. Our information will 
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be that the expectation values of the mass and moment of 

inertia SPj^Cj^ are known to be Mg and Cg, respectively. In prac- 
tice, Mg and Cg are the experimentally determined values. Ulti- 
mately we will average over all of the models and find nj, the 
expectation value for the number of particles in the jth cube. 

From this we can determine the "most likely" density distribution. 

The probabilities are computed according to Jaynes's prin- 
ciple of minimum prejudice: 

9[-ZP^lnP^ + cXqSP^ + Uj^EP^M^ + UgEP^C^] = 0 


giving; 


e“l“i + ^ 2^1 


P. = 
1 


where 


Z 


= pi-«o = 




(5.2) 


From (5.1) we can write 


m 


En 


ji 


(5.3) 


^rT j ^ 

where nj^ is the number of particles in the jth cube according to 
the ith model. The problem now looks exactly like that of the 
grand canonical ensemble in statistical mechanics, with nj^ playing 
the role of occupation numbers, rj2 the energy levels, and (5.2) 
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the grand partition function, the equations analogous to (5.3) 
being 

N = Enj 

B . LnjEj 


The treatment of this problem may be found in any standard sta- 
tistical mechanics text. I will follow Morse (1969, p. 326). 
Using (5.3) in (5.2) we have 


a, 2 n . . + a~En. . r 
Z = Ee ^ 2 31 D 

i 


(5.4) 


where we have redefined Oj^m as and (2/3)a!2m as 0 * 2 . Note that 

8lnZ = E(Znj^) e^l^’^ji 


1 3 


= E En . . ~ 3 

. . 31 ^ 

3 1 


En^ .. + a^En^.. r. 


Di 


2 i 31 D = 


En Pi = En , 
D 


(5.5) 


a result that we will make use of shortly. 

Let us now rewrite (5.4) as a summation over the possible 
values of n^ instead of over i. Since there are no limits on the 
possible number of particles occupying each cube, we obviously 
have 



(5.6) 
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where 




1 - e“l + “2‘^j 


by virtue of 



X 


n 


so that Z separates into factors for each cube. 
From (5.5) and (5.6) we have 


where evidently 


31nZ 

3a, 




1 

Hj = (5 7) 

-a, -a„rT - 1 
e - 3 

Equation (5.7) is identical to the equation for the average number 
of particles in an energy state, assuming the particles follow 
Bose-Einstein statistics (Morse 1969, p. 326). This is hardly 
surprising, since the assumptions regarding the particles are the 
same: indistinguishabllity , plus no limits on the number of 

particles occupying a given state. 

Let us now make an assumption regarding (5.7) which is well- 
founded in classical physics (Morse 1969, p. 325): we assume 

that the cubes with volume V may be taken so small that the cubes 


are sparsely occupied by the particles, thus making the average 
number of particles in any given cube a small number compared to 
1. This is equivalent to assuming the particles follow Maxwell- 
Boltzmann statistics (Morse 1969, p. 324), and (5.7) becomes 

n . = « 1 

^ 2 

-a, “a.,r . , 

e 1 2 j - 1 

so that 

•“l ■“2'^J » I 


and 


n. 



2 


The density distribution is obviously 


P(rj) 



By the assumption of spherical symmetry for the density 
distribution we can drop the subscript j and write 


2 (5.8) 

p(r) = p{0)e“2^ 

which we take as the desired information theory density distribu- 
tion. The two constants ^(0) = ^ e°^l and may be found from our 


• jP 


15 


knowledge of the expectation value for the mass and moment of 
inertia: 


M 

E 




r^dr = 5.976 x 10^^ 

4 44 

r^dr = 8.068 X 10 


g 


g • 


cm 


(5.9) 


where a^ is the radius of the earth and our numerical values have 
come from Stacey (1969, p. 279). We have assumed in (5.9) that 
the cubes are so small that we may switch from summations to in- 
tegrals without serious error. liy numt?rical integration of (5.9), 
or from standard mathematical tabU's (Graber, private communica- 
tion 1978), we find that 

2 2 

F(r) = 12. g/cm^ (5.10) 


is our "best guess" for the density distribution based on the 
given data. 

A plot of (5.10) appears in Fig. 1 (refer to page 37), along 
with the "optimum" density distribution given by Bullen (1975, p. 
361), which presumbably gives the most plausible distribution on 
the basis of all the known data. (Rietsch 1977 also compares his 
curve to Bullen's.) The two curves agree remarkably well, in view 
of the fact that the information theory density distribution makes 
use of only two basic pieces of data: mass and moment of inertia. 

No seismic or free oscillation data have been included in our 
infoi-mation . 
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6. DISCUSSION 


6.1 COMPARISON WITH RIETSCH (1977) 

My analysis differs from that of Rietsch (1977) at several 
points. Rietsch (1977) begins with Jaynes's (1968) proposed 
generalization of Shannon's (1948) information measure (4.1) for 
continuous distributions 

( 1 \ 

MI = - J p(p) log [ p(p)/w(p) ] dv 

where p(p) is the probability distribution, pis the density dis- 
tribution, and dv is the volume element for the parameter space. 

The w(p) appearing in (6.1) is the prior probability distribution 
which obtains when no information is known. Equation (6.1) differs 
from Shannon's (1948) own proposed measure for continuous distri- 
butions in that w(p) does not appear. I will argue in the follow- 
ing paragraphs that Shannon's equation is a more useful measure 
than Jaynes's for continuous distributions, However, the distinc- 
tion is academic in this case since Rietsch (1977) chooses a con- 
stant prior distribution w(p), which for all practical purposes 
makes the two measures the same. 

Rietsch (1977) then takes advantage of the spherical symmetry 
of the problem and divides up the earth with spherical shells, 
with the shell radii chosen so that all of the volumes between the 
shells are equal. Later he lets the number of shells approach 
infinity to obtain a continuous density distribution. I chose 
cubes instead of shells to lay the groundwork for the general case 
where there is no spherical symmetry; in particular, for finding 


the density distribution which uses the known spherical harmonic 
coefficients of the geopotential as constraints (as explained 
in Section 6.5). So far the differences between approaches 
are minor. 

The last difference of note between Rietsch's (1977) treat- 
ment and my own is the one which produces the differing equations 
for the density distribution. Rietsch (1977) chooses to fill the 
earth's volume elements with continually varying masses, rather 
than discrete particles. It is as though his volume elements 
may be filled with a continuous liquid of any amount, rather than 
with discrete mass-points as in my own. This leads him to a den- 
sity distribution of the form (Rietsch 1977, p. 503) 

1 

p(r) = 

and from doing integrals instead of summations. His distribution 
has the same qualitative behavior as my own (5.10) and looks very 
much the same when plotted, but obviously the functional form is 
different. I chose to use discrete particles, since this is more 
in keeping with what we know about atoms, and because it more 
closely follows the traditional statistical mechanical development. 

I should also mention that Rietsch (1977) investigated a more 
general distribution by putting limits on the highest and lowest 
density allowed in each volume element, instead of letting it 
vary between zero and infinity. 
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6.2 FURTHER DISCUSSION OF ASSUMPTIONS 

A natural question to ask at this point is: how well do our 

assumptions reflect the real earth? For instance, what about 
using particles of equal mass? Where do atoms and molecules fit 
in? The answer to the last question is: they don't, at this 

stage. We chose this simplified model to obtain a tractable 
problem and illustrate the method; these are merely the first 
steps. If pressed upon this point, we can take our "particle" to 
be a proton+neutron+electron . The reason for choosing this com- 
bination is that the earth is made up predominately of elements 
with low atomic number. The nuclei of such elements very nearly 
have equal numbers of protons and neutrons. (In the heavy elements, 
neutrons significantly outnumber protons.) Also, electrical 
neutrality prevails, so that for every proton there is an electron. 
Hence we may think of the proton+neutron+electron as a naturally 
occurring unit from which the earth is made. We can then pretend 
that these "particles" are spread throughout the earth, and claim 
to know nothing of atoms, molecules, chemical bonding, etc., which 
would constitute further information. This line of argtiment also 
takes care of any further objection to choosing indistinguishable 
particles, since the previously mentioned elementary particles are 
indistinguishable in the fundamental sense. However, this is going 
to extremes. 

The assumption that the cubes are sparsely occupied, thus 
giving Maxwell-Boltzmann statistics, may also be objectionable 
from an operational standpoint. After all, volumes actually 
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r 


measured contain hugh numbers of particles. Relaxing this condi- 
tion means that we are back to Bose-Einstein statisitcs and (5.7) 
gives the average number of particles in a given cube. The den- 
sity is then 


P (r) 


m 



‘2-^J 


1 


( 6 . 2 ) 


With this approach we have a problem, because there are more un- 
knowns than constraints. If we knew how to choose m/V, then we 
could find oe^ and from the constraints of mass and moment of 
inertia, as we did before, Unfortunately, we have no clear guid- 
ance in this matter. Even if we choose m to be the mass of our 
"particle”, we would still have to find V. 

There is a way, however, to neatly sidestep the problem. We 
introduce a third piece of information: we assume we know p(ag), 

the value of the density at the earth's surface. Using this in- 
formation in i6.2) yields 


p(r) = 


(e ^e ^ ^ -DpU^) 


e e -1 


(6.3) 


and we use our knowledge of aj and £>'2 to find the two multipliers, 
I will not carry through the calculation, since according to 
Stacey (1969, p. 104) the surface density of rocks is 2.84 g/cm^. 
Our Maxwell-Boltzmann equation (5.10) already gives a surface 


nv:. ■■ • v i:iE 
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density of 2.86 g/ctn^, so that (5.8) and (6.3) differ only triv- 
ially. 

We can therefore use Bose-Einstein statistics in the informa- 
tion theory approach to the density distribution; but while more 
general than Maxwell-Boltzmann statistics, it is more complicated 
mathematicaJ ly , as a comparison of (5.8) and (6.3) shows. The 
Maxwell-Boltzmann case should probably be investigated first in 
future developments, being simpler. 

6.3 INFORMATION MEASURE FOR CONTINUOUS DISTRIBUTIONS 

Shannon (1948, p. 628) proposed as the appropriate general- 
izat. '-i of (4.1) for continuous distributions the function 

MI = -K p(x)lnp(x)dx (6.4) 

where p(x) is the probability distribution and x is a continuous 
parameter. I will confine the discussion to the one dimensional 
case without loss of generality. 

There are three basic objections to (6.4) being the appro- 
priate measure (Jaynes 1963, 1968; Hobson and Cheng 1973) which 
are, in order of increasing seriousnesi: . (a) It is dimensionally 
incorrect; (b) an infinity is thrown away in deriving it; and (c) 
the form of the prior probability distribution is not invariant 
under a change of variables. To corrt t these difficulties, 

Hobson and Cheng (1973) and Jaynes (1963, 1968) propose using the 
Kullback measure in its place, of which Jaynes's equation (6.1) 
is but a special case. However, Tribus and Rossi (1973) and Batty 
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(1974) argue that Shannon's original equation (6.4) is the appro- 
priate measure. 

While Rietsch (1977) follows Jaynes (1963, 1968) and Hobson 
and Cheng (1973), I follow Tribus and Rossi (1973) and Batty 
(1974), and will outline my reasons for doing so. 

The problem of point (a) is the following. Suppose p(x) dx 
represents the probability oi finding a particle in an interval 
dx near speed x. Then dx has dimensions cm/s and p(x) has dimen- 
sions s/cm, so that the product of the two p(x) dx is dimension- 
less. But the logarithm of p(x) is taken in (6.4), and this is 
not allowed for dimensional quantities. So (6.4) cannot be di- 
mensionally correct. 

The problem is easily remedied. As we shall see below when 
point (b) is discussed, the difficulty develops when p(x) is sep- 
arated from dx. If we introduce a constant of value 1 and dimen- 
sions of X, then we can write 

p(x) dx = [Dp(x)] l^clxj 

where D is the constant. Each expression in brackets on the right 
side of the equation above is now dimensionless, and we can now 
separate p(x) from dx without falling into error with the loga- 
rithm. Since D has value 1, we can suppress it, its presence 
being understood. We will assume this has been done in our dis- 
cussion in the following paras:^i aphs . 
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As for point (b), in the continuous case goes over to 
p(x) dx and (4.1) becomes: 

lim \- log fp(xjdxl| 

dx- 0 ( i ^ L ^ J ) 


-Jp(x) logj^p(x)j dx -y*p(x) log (dx) 


dx 


where K has been set equal to 1. Obviously as dx — 0, log(dx)— -«i 
and the right side approaches infinity. At this point the log(dx) 
term is subtracted off, leaving a well-behaved function which is 
just (6.4). However, subtracting infinity from infinity and 
obtaining a finite number is usually unsound mathematically. 

This can be taken care of by going back to (4.1) and writing 
it in exponential form 

-MI n Pi 

Then in the continuous case it goes over to 


5[p(Xi)dx]P<='i>'^=‘ 


which can be written 


-MI 

e 


1 = 


5[p(x.)P'^i»^^][dxP'='i>‘’='] = [5 


p(xi) 


p (xi)dx 


• dx 
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by noting that 


n 

i 





p (x^)dx 



Vhen more information becomes available the probability distribu- 
tion p(x) goes over to some new one q(x), whose associated measure 
is 



Dividing one by the other and letting dx approach 0 gives 


MI 2 j 

.[P p(xi)P<='i>‘^-''_ 

dx 


e“l I 

q(x.)9(Xi)<3='‘ 

dx 


The dx's cancel, giving 
^^2 


log 


.Mil 


= MI 2 - MI 


I = - J q(x)log q(x) dx + J p(x)log p(x) dx 


which is well behaved. So we may as well write (6.4) as the MI 
for continuous distributions, since we know now that the infinity 
associated with the old distribution is the same as for the new 
distribution, and subtraction of the two leads to no difficulties. 
We are therefore concerned with chaxrges in the amount of infor- 
mation, and not in the amount itself. 
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The force of (c) may be seen from the following example. 

Suppose once again that p(x)dx represents the probability of 
finding a particle in the neighborhood dx near speed x, and further, 
that the speed is definitely known to lie between values x^ and X 2 - 
We have no further information. If we apply Jaynes's principle 
of minimum prejudice, we find 

9 [ - fp(x) log p(x) dx + afp(x) dx ] = q 
dP ■' 


so that 


-log p (x) - 1 + o = 0 ; p (x) = e = 

X2 - 

so that p(x) is a constant for Xj^^x^Xg and zero outside the in- 
terval. But the kinetic energy mx^/2 (where m is the mass of the 
particle) is a perfectly respectable physical quantity. Why not 
take it as the continuous parameter? If we do so by setting 
y = mx^/2 and apply Jaynes's principle once again we find: 

a [~ / s (y) log s(y) dy + «/s(y) dy] ^ ^ 
d s 


which implies 


s (y) 



for yj^ < y < y 2 


The two distributions are inconsistent: a constant distribution 


for speed implies a nonconstant distribution for energy and vice 
versa, as may be seen from 


s(y)dy = 


mx 

m 

2 


dx 



2x dx 



pi 


p (x) dx 


Hence there is no one prior probability distribution. We can 
make it what we please for any given parameter by a suitable 
change of variables. To escape this difficulty, Jaynes (1968) 
proposes to find the prior distribution w(p) in (6.1) via the 
theory of groups. This matter is also discussed by Rietsch (1977, 
pp. 494--495) and by Rowlinson (1970). 

I must agree with Tribas and Rossi (1973) that there really 
is no problem; for to change variables is to ask a different 
Question. However, I am not entirely certain that their reasons 
for believing so are the same as mine, due to the terseness of 
their discussion. Therefore I will give my reasons below. 

Let us start by recalling the meaning of (4.1). It is the 
amount of information needed to single out the correct answer 
from N possible answers. Now as goes over to p(x) dx in the 
continuous case, the interpretation of (6.4) must be that it 
determines the amount of information needed to trap the correct 
answer in any one of the small intervals dx. Likewise, if we had 
chosen some other variable y = f(x) as the continuous parameter, 
then MI answers the question of hov; much information is needed 
to trap the answer in any one of the intervals dy. So when we 
pick X or y as the parameter, we are asking different questions 
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about the problem. There is no one prior distribution. The 
amount of information needed to answer a question is a function 
of the question asked. 

Let us clarify the situation with our example in which x is 
speed and y is kinetic energy. Suppose that the mass of the 
particle is 2 g and the speed is definitely known to lie between 

0 and 100 cm/s. Take dx to be 1 cm. The energy obviously lies 
between 0 and 10“^ ergs. Take dy to be 1 erg when y is the prior 
variable. 

Now to use X or to use y as the parameter is to ask two dif- 
ferent questions. Trapping the particle speed to within 1 cm/s 
is not the same as trapping it to within 1 erg. If x is near 50 
cm/s for example, then dy 2x dx and trapping the speed to within 

1 cm/s means we know the energy to within ~2-50-l = 100 ergs, and 
not 1 erg. So it is a question of resolution. Equal resolution 
along the speed axis implies unequal resolution along the energy 
axis and vice versa. Hence it is meaningless to ask, what is the 
prior probability distribution? That depends on the question 
you are asking. 

This argument is bolstered by noting an unsatisfactory as- 
pect of (6.1) pointed out by Tribus and Rossi (1973): the infor- 

mation needed to single out the correct answer depends on the 
order in which the information is given. That it not depend on 
the order given is an essential part of deriving (4.1) (Baierlein 
1971, pp. 64-74), and Shannon's (1948) equation is true to this 
condition (Tribus and Rossi 1973) when generalized to continuous 
distributions . 
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Theories are neither right nor wrong; they have only varying 
degrees of usefulness (Tribus 1966, p. 207). What I am suggesting 
here is that Shannon's (1948) original equation (6.4) is more use- 
ful than Jaynes's (1963, 1968) alternative equation when talking 
about information and continuous distributions. 

6.4 CRITICISMS OF INFORMATION THEORY INFERENCE 

Some criticisms of information theory inference which have 
been raised will be briefly discussed here, starting with the 
coin flip problem. 

Rowlinson (1970) argues that information theory is unable to 
deal with certain kinds of information. Suppose, for example, 
that we flip a coin 100 times and it comes up heads 75 times. 
Clearly we have some relevant information on whether the next 
flip will be heads or a tails. Rowlinson (1970) claims that 
information theory cannot handle this problem. 

Tribus (1969), an ardent proponent of information theory 
inference, would probably answer this challenge according to 
the algorithm given on page 120 of his book: assign probabilities 

according to Jaynes's principle, and then modify the probabilities 
using Bayes's theorem when new information becomes available. In 
the coin flip problem the original information would be that two 
outcomes are possible, giving probability 1/2 to heads and tails. 
The new information would be the 75 heads out of 100 flips. This 
would be used in Bayes's theorem to give the new probabilities. 

I will not pursue this problem further here. 
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Another apparent drawback is that information theory infer- 
ence at times gives "unphysical" answers. For instance, if we 
omitted our knowledge of the moment of inertia in solving for 
the density distribution, then we would obtain a constant density 
all throughout the earth, as may be easily verified. This does 
not seem reasonable; we feel that the density should certainly 
increase towards the center of the planet. The value of infor- 
mation theory inference appears questionable in this instance. 

The problem is easily resolved, as may be seen in the fol- 
lowing example. Suppose that instead of guessing the earth's 
density distribution we were confronted with a small object of 
exotic shape and unknov/n composition and asked to guess its den- 
sity distribution on the basis of known mass and volume. In this 
case a constant density distribution does not appear at all un- 
reasonable; this is because we are in a state of extreme ignorance 
about the object. With the earth, however, this is not the case: 
we have some ideas about how the earth ought to behave. In this 
instance it is that gravity should pull the heavier material 
towards the center of the earth, and high interior pressures will 
compress it, making the density increase towards the center. 

Hence we are dealing with tacit information. We can hardly with- 
hold information from the method and then criticize it for not 
reproducing what we did not tell it! So if an answer appears 
"unphysical", then we have not been fair to the method; we did 
not tell it everything we knew. 


6.5 FUTURE DIRECTIONS FOR THE THEORY 


The results obtained above may be easily generalized to in- 
clude any known volume integrals of the density distribution. 
Supposing that there are L such integrals having the form 


p(r) f (r')dv = F. 


volume 
of earth 

the resulting average density distribution is 


(i = 1, . . .L) 


pCr) = const • exp (o^f 1 ("r) + 02 f 2 (^) +. . • toLfL (7) ) 


The Lagrange multipliers are to be found from the known values 

Note that the above result is not restricted to the spheri- 
cally symmetric case. Besides the mass and moment of inertia, 
the spherical harmonic coefficients and Sj^ of the earth's 

gravitational field immediately come to mind as integrals having 
this form. I intend to publish the resulting p(T) based on the 
gravity field coefficients in the near future. 

The next obvious extension of the theory is to assume that 
the earth is an elastic body so as to include the elastic param- 
eters pCr") and \Cv) in addition to the density distribution pCr) 
as unknown quantities to estimate. This will allow seismic travel 
times, free oscillation periods, and body tide observations to be 
used, all of which depend \ipon pC"?), \(r) and p("r). Graber (1977) 
has already made a start in this direction using mass, moment of 
inertia, and three zero-node torsional normal modes of degree 
1=2, 8, and 26 of the earth. More realistic treatment of atoms 
and molecules has already been mentioned. Information theory 


30 


inference should be compared to other inverse techniques, such 
as the Backus-Gilbert method. Gull and Daniell (1978) briefly 
discuss the two methods. The goal of information theory inference 
is to put in all of the physics and data we know about the earth 
and maximize the remaining missing information. 

Since there will never come a day when we have all of the 
information, solid earth geophysics will always have a need for 
sound methods of inference. Information theory is such a method. 
Its philosophical basis is satisfying: no unwarranted weighting 

of possible answers. It is rational and objective: Everyone 

using it will obtain the same answers, given the same data (once 
the formulation of the problem is agreed upon!); it gives the 
"Dest” answer on the basis of very little data; it provides an 
alternative to extensive modeling; and its mathematics is 
standard — that of statistical mechanics. Information theory 
inference should find extensive use in solid earth geophysics. 
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Figure 1. The information theory density distribution using 
Maxwell-Boltzmann statistics (curve A) and the optimum density 
distribution of Bullen (1975) (curve B) are shown as a function 

of radial distance r. 
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